Prosper Marketplace is America’s first peer-to-peer lending marketplace, with over $ 7 billions in funded loans. Borrowers request personal loans on Prosper and investors (individual or institutional) can fund anywhere from $2,000 to $35,000 per loan request. Investors can consider borrowers’ credit scores, ratings, and histories and the category of the loan. Prosper handles the s ervicing of the loan and collects and distributes borrower payments and interest back to the loan investors.
A personal loan is an unsecured loan typically from $1,000 - $100,000 with fixed or variable interest rates that can be used to make a large purchase (medical procedure, home improvement, engagement ring, wedding, baby or other major life events) or to consolidate debt such as credit card, for example.
Prosper has a transaction-based business model, in which the company collects revenue by taking a fee on its customers’ transactions. Borrowers who receive a loan, pay an origination fee of 1.00% to 5.00%,depending on the borrower’s Prosper Rating, and investors pay a 1% annual servicing fee.Every loan application is assigned a Prosper Rating—a proprietary rating system that allows for consistency in the evaluation of applicants. Potential investors use Prosper Ratings to help decide whether to commit to invest in your loan listing.
I chose Prosper Loan data to analyze because I am very fascinated with the amounts of personal loans they offer and curious about the market Prosper is interested in. How does Prosper set interest rate? What makes investors want to invest in these short term notes. I wonder what kind of risks and returns associated with these loans? As the borrower’s status changed, such as employment, marital status or spending habits, will the risk of default increase? I will be more curious and post more questions as I go deeper into the analysis.
ProsperLoanData is one of the data set suggested by Udacity for the R Analysis project
*[Available at:]
(https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv)
*[Variable Definitions availble at:] (https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk- Yx3z0XDEtI/edit#gid=0)
=========================================================================================================
Since Prosper claimed that they have originated over $9 billions in consumer loans (to date), I want to see what kind of distributions of the loan counts over the years by LoanOriginationQuarter.
The plot did not come out as I wanted because it’s sorted by quarter and then by year. I have to change the order of the factor levels by specifying the order explicity. After applying this code, ggplot now understands in which order to plot my LoanOriginationQuarter on \(x\) axis.
## Q1 2006 Q1 2007 Q1 2008 Q1 2010 Q1 2011 Q1 2012 Q1 2013 Q1 2014 Q2 2006
## 315 3079 3074 1243 1744 4435 3616 12172 1254
## Q2 2007 Q2 2008 Q2 2009 Q2 2010 Q2 2011 Q2 2012 Q2 2013 Q3 2006 Q3 2007
## 3118 4344 13 1539 2478 5061 7099 1934 2671
## Q3 2008 Q3 2009 Q3 2010 Q3 2011 Q3 2012 Q3 2013 Q4 2005 Q4 2006 Q4 2007
## 3602 585 1270 3093 5632 9180 22 2403 2592
## Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Q4 2013
## 532 1449 1600 3913 4425 14450
=========================================================================================================
From this graph, since the inception in Q4 2005, the number of loans grown nicely per quarter, from 22 to almost 4344 loans until Q4 2008. In Q4 2008, it only had about 532 loans and then almost nothing until Q4 2009. It took 4 years, till Q1 2012, for Prosper to get back to the same (# of loans) level as of Q2 2008.
I wonder what happened and found that on November 24, 2008, the SEC found Prosper to be in violation of the Securities Act of 1933. As a result of these findings, the SEC imposed a cease and desist order on Prosper. Due primarily to the novel nature of the peer-to-peer lending models, the SEC, after review, now treats all peer-to-peer lending transactions as sales of securities and requires that all platforms register with the SEC. In addition to this violation, on November 26, 2008, a class action lawsuit was filed against Prosper in CA. I believe these events had negatively impact Prosper’s ability to motivate consumers and investors to borrow and fund the loans respectively.
Univariate Analysis of Prosper Score, Loan Status and Borrower Rate- Annually
Since its SEC registration in 2009, Prosper has provided a proprietary “Prosper Rating” for prospective borrowers based on the company’s estimation of that borrower’s “estimated loss rate.” According to the company, that figure is “determined by two scores: (1) the credit score, obtained from an official credit reporting agency, and (2) the Prosper Score, figured in-house based on the Prosper population.” Prosper Ratings, from lowest-risk to highest-risk, are labeled AA, A, B, C, D, E, and HR (“High Risk”). Prosper Score, a custom risk score built using historical Prosper data. The score ranges from 1-11, with 11 being the best, or lowest risk score. Applicable for loans originated after July 2009.
Prosper rating Estimated Average Annual Loss Rate AA 0.00–1.99% A 2.00–3.99% B 4.00–5.99% C 6.00–8.99% D 9.00–11.99% E 12.00–14.99% *HR 15.00%+
Part of the data wrangling and tidy to make the graphs work, are to rename some of the variables because it’s has the parenthesis around it. These following codes take care of this problem.
=========================================================================================================
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 4.00 6.00 5.95 8.00 11.00 29084
=========================================================================================================
The Prosper Borrower’s Score is a normal distribution with the mean = 5.95 and median = 6. There were none at 1 (worse) or at 11 (best).
=======================================================================================================
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
=======================================================================================================
The plot of the Loan Status surprised me in that most of the loans are Current (58,000); Completed is about 38,000 with Charged Off around 11,000 loans. The plot is skewed to the left with long tail on the right for the past due amounts in days. The charged off is about 10% of the total Prosper number of loans, this is a very high risk business but they are probably making it up by imposing such high interest rates and to attrack investors. Let’s look at that next.
=======================================================================================================
## loanData$Term: 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.0929 0.1434 0.1501 0.2064 0.2669
## --------------------------------------------------------
## loanData$Term: 36
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1274 0.1815 0.1935 0.2599 0.4975
## --------------------------------------------------------
## loanData$Term: 60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0669 0.1490 0.1870 0.1930 0.2319 0.3304
=========================================================================================================
As expected, the interest rate is pretty high and there was a spike at rate 32%.
The mean for borrower’s interest rate is 19.28%. However, it does not look righ since I also saw some above 50%?
In the loan term of 36 months, the max is 49.75%. 49.75% is outliner, I will focus on the majority of loan rates, which are in the range of 6% to 35%. To do this, I will transform the data using log10().
=========================================================================================================
=========================================================================================================
Majority of the loans interest rates are well above 15%, between 15% to 20%.
Next, I want to look at the listing Category to figure out why borrowers are willing to pay for these high interest rates.
The category of the listing that the borrower selected when posting their l isting: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
=========================================================================================================
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 2.774 3.000 20.000
=========================================================================================================
Most loan’s terms are 36 months or 60 months in duration. Majority of the loans from both terms are used for their Debt Consolidations (= 1), the next highest category is Unknown (= 7), then Home Improvement (=2) and Business (=3). The plots skew to the left, with long tails on the right.
Now, I just want to see what the plots look like on loans with 36 and 60 months terms. To do this, I will create a subset to omit loans with 12 month terms & Category Listing is < 7, which are the majority of the loan listings.
=========================================================================================================
=========================================================================================================
Interestingly, most of the loans were used for debt consolidation. I thought credit card interest rates were high but if borrowers paid for Prosper’s rates instead of paying for their credit cards, how does having a loan with Prosper help them financially?
Let’s look at what their Credit Card Revolving Balance and Prosper Loan amount and its monthly payments. I will plots these following: * a) The Total Revolving Credit Card Balance for the individuals with Prosper Loan. * b) The Prosper Loan amount (who used it for Debt Consolidation) * d) The monthly payments with Prosper, assuming that their Credit Card Balance i s reduced by the loan amounts from Prosper.
=========================================================================================================
=========================================================================================================
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3121 8549 17600 19520 1436000 7604
========================================================================================================= * Majority of borrower’s revolving credit card balances are between 3000 to 23000. The plot is skewed to the left.
Applying the scalexlog10(), the Revolving Credit Balance looks like a normal distribution.
## # A tibble: 2 x 4
## `ListingCategory == 1` loan_amount_mean loan_amount_median n
## <lgl> <dbl> <dbl> <int>
## 1 F 6690 4700 55629
## 2 T 9908 9500 58308
=========================================================================================================
## loanData$ListingCategory == 1: FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 103.9 172.8 229.1 307.4 2179.0
## --------------------------------------------------------
## loanData$ListingCategory == 1: TRUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 166.5 283.0 313.8 410.0 2252.0
========================================================================================================= * *Most borrowers prefer to have a 36 months loan, over 60 months and 12 months. The plot shows the amount of loans used for debt consolidation is mostly in the range of $10,000 - or under.
Prosper’s interest rates are much higher than the Revolving Credit Card’s rates, however, when consolicated into one low monthly payments, it seems to work for borrowers.
For borrowers who use the Prosper Loan’s money to consolidate their debt: The median of their monthly payment is 283, mean is 313.8 and the max is 2,252.
For the rest, montly payment median is at 172.8, mean at 229.1 and max is 2179. Also, if you notice, there is a break accross the plot where there is no loan provided accross the loan amounts (white line on the top of the plot). It’s related to the 2008 period we talked about earlier.
The plot is skewed to the left, with long tail on the right.
Transform the above plot and focus on majority of loans with payments < 1000 & loan less than 25000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2252.0
=========================================================================================================
Majority of the loans with monthly payment of less than $1000.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 114.0 271.0 398.3 525.0 14980.0
=========================================================================================================
The open revolving credit card payment plot is skewed to the left with a very long tail to the right. I can’t tell much from looking at this plot.
The plot to the right is a transformation of the original one, using scale_x_log10().
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 114.0 271.0 398.3 525.0 14980.0
=========================================================================================================
This analysis is important to me because it’s satisfied my curiousity of why borrowers pay higher interest rates than their credit cards.
I filter the Listing Category of Debt Consolidation, then select the Revolving Credit Balance, its monthly payments vs. Prosper’s original loan amount and its monthly payments. The summary and plots clearly show that Prosper monthly payments are much more affordable for borrowers, with mean of 272.5 vs. 398 for their Revolving Credit Card.
Let’s look at borrower’s employment status, their stated monthly income and Prosper’s loan amount by their employment status.
## # A tibble: 9 x 4
## EmploymentStatus mean_monthly_Income median_monthly_Income n
## <fct> <dbl> <dbl> <int>
## 1 "" 5165 4083 2255
## 2 Employed 6139 5205 67322
## 3 Full-time 5043 4250 26355
## 4 Not available 4555 3583 5347
## 5 Not employed 197 0 835
## 6 Other 3568 3167 3806
## 7 Part-time 1640 1379 1088
## 8 Retired 2987 2617 795
## 9 Self-employed 6338 4333 6134
##
## Pearson's product-moment correlation
##
## data: loanData$ProsperScore and loanData$BorrowerRate
## t = -248.98, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6536072 -0.6458311
## sample estimates:
## cor
## -0.6497361
=========================================================================================================
Looking at the summary group by Employment Status and the monthly stated income of borrowers, it seems this market dealt with borrowers in the high risks end of the spectrum.
They provided loans for borrowers with employment status of “not availbale”, “not employed”, “none” and “Other”.
It seems Prosper has very lenient policies as opposed to guidelines imposed by normal baking industry. I believe it’s because Prosper is not at risk but the investors. I want to investigate further on this. Investors attrack to the higher return, of course, that come with higher risks. So, let’s look at the analyses from the Investor’s perspective.
Is there any correlation between Borrower’s rates and Prosper’s ratings
To answer this question, I used the cor.test to test the two mentioned-relationships above.
The results are listed below:
Pearson's product-moment correlation
I think using the geom_jitter, it shows nicely the grouping of Prosper borrower’s scores in relation to the borrower interest rates.
There is a negative correlation between Borrower’s Rates and Prosper’s Score of -.65. The higher the borrower’s rates, the lower Prosper’s ratings (0 to 10).
As it’s clearly shown in the plot and that is evidence in the smooth red line).
With that in mind, I shall look at the Return Rate and Investors next.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.074 0.092 0.096 0.117 0.284 29084
=========================================================================================================
The mean of the rate of return is 9.6% with the max of 28.4%. However, sometimes they also have a negative rate of return too.
If that’s the case, then are all the loans fully funded? Let’s find out.
I am going to create a new data frame call ReturnInvestment and contain the following variables:
Let’s look at the Investors next.
## mean_funded median_funded min_funded max_funded n
## 1 0.9985835 1 0.7 1.0125 113937
## Mean_Investor Meidan_Investor min_nvestors max_nvestors n
## 1 80.47523 44 1 1189 113937
=========================================================================================================
Wow, the result from the summary of Investors code surprises me.
Now I understand why Investors are willing to invest in these loans. They all share the risks and investors don’t have to invest in one loan at a time. They can invest in many loans and with small amounts per loan. Thus their risks are also diversified. They manage their risks like how people manage the stock investment porfolios.
Lets look at sole investors, since I think this is the most risky area and whether it’s worth the risk.
=========================================================================================================
The reason I chose this analysis because as discovered from the previous analysis of Investors, most of them don’t solely invested in one loan but multiple loans with small amounts. Their risks are well diversified over the porfolios.
So by looking at these sole investors, we can see clearly that on the average, their return of investment is below 10% but they are taking a higher risk than if they invested in multiple loans for the same amount of funding.
The return is not higher but contain higher risk.
To conclude this project, I will create scatter plot matrices, by selecting seed and subset to include only variables that I want to show.
========================================================================================================
Since this is such big data, what I want to do is to select 5000 loans with loan amount < 10000 and the listing selection is debt consolitation. Then I subset my loan data and pick only variables I think they have direct correlated to the investors’ rate of return. However, when plotting it using ggpairs, I have so much problems with this single plot. It took more than 12 hours to work on just one. I carefully used the same codes that i learned and went over the lessons more than 10x just to figure out why the plot does not work.
Finally, I get it. The lower left the triangle plot use the group histograms for quantitative pairs and scatter plots for quantitative pairs. The upper triangle, use x and provide the correlation for quantitative quantitive pairs.
To wrap up this project, I want to build the lm model to see if the prediction of the rate of return is closed to what Prosper said on their website. I pick one of the loan on the web site with funding required, that has the following criteria:
I used the lm model to build the variables ProsperScore, ListingCategory, LoanOriginalAmount, BorrowerAPR, MonthlyLoanPayment.
m1 <- lm(EstimatedReturn ~ I(LoanOriginalAmount), data = loanData[loanData\(LoanOriginalAmount <20000 & loanData\)ListingCategory==1 & loanData\(EstimatedReturn >0 & loanData\)ProsperScore > 7,])
m5 <- update(m4, ~ . + ListingCategor)
mtable(m1, m2, m3, m4, m5)
Apply the lm model to Prosper current investment listing, using the criteria listed
Evaluate how well the prediction of return of investment from Prosper based on the lm model.
It did not work as it created the following error. I can’t find anything addressed this issue so I decided to use another approach as seen next.
Warning message: In predict.lm(m5, newdata = This_loan, interval = “prediction”, : prediction from a rank-deficient fit may be misleading
##
## Calls:
## m1: lm(formula = EstimatedReturn ~ I(LoanOriginalAmount), data = loanData[loanData$LoanOriginalAmount <
## 20000 & loanData$ListingCategory == 1 & loanData$EstimatedReturn >
## 0 & loanData$ProsperScore > 7, ])
## m2: lm(formula = EstimatedReturn ~ I(LoanOriginalAmount) + BorrowerAPR,
## data = loanData[loanData$LoanOriginalAmount < 20000 & loanData$ListingCategory ==
## 1 & loanData$EstimatedReturn > 0 & loanData$ProsperScore >
## 7, ])
## m3: lm(formula = EstimatedReturn ~ I(LoanOriginalAmount) + BorrowerAPR +
## EstimatedReturn, data = loanData[loanData$LoanOriginalAmount <
## 20000 & loanData$ListingCategory == 1 & loanData$EstimatedReturn >
## 0 & loanData$ProsperScore > 7, ])
## m4: lm(formula = EstimatedReturn ~ I(LoanOriginalAmount) + BorrowerAPR +
## EstimatedReturn + ProsperScore, data = loanData[loanData$LoanOriginalAmount <
## 20000 & loanData$ListingCategory == 1 & loanData$EstimatedReturn >
## 0 & loanData$ProsperScore > 7, ])
## m5: lm(formula = EstimatedReturn ~ I(LoanOriginalAmount) + BorrowerAPR +
## EstimatedReturn + ProsperScore + ListingCategory, data = loanData[loanData$LoanOriginalAmount <
## 20000 & loanData$ListingCategory == 1 & loanData$EstimatedReturn >
## 0 & loanData$ProsperScore > 7, ])
##
## ====================================================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------------------------------
## (Intercept) 0.084*** 0.008*** 0.008*** 0.002 0.002
## (0.001) (0.000) (0.000) (0.001) (0.001)
## I(LoanOriginalAmount) -0.000*** 0.000** 0.000** 0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## BorrowerAPR 0.442*** 0.442*** 0.447*** 0.447***
## (0.002) (0.002) (0.002) (0.002)
## ProsperScore 0.001*** 0.001***
## (0.000) (0.000)
## ----------------------------------------------------------------------------------------------------
## R-squared 0.017 0.780 0.780 0.780 0.780
## adj. R-squared 0.017 0.780 0.780 0.780 0.780
## sigma 0.026 0.012 0.012 0.012 0.012
## F 217.519 21725.060 21725.060 14511.305 14511.305
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 27155.705 36326.200 36326.200 36335.799 36335.799
## Deviance 8.526 1.908 1.908 1.905 1.905
## AIC -54305.411 -72644.399 -72644.399 -72661.599 -72661.599
## BIC -54283.170 -72614.745 -72614.745 -72624.531 -72624.531
## N 12253 12253 12253 12253 12253
## ====================================================================================================
## [1] 0.01247088
## [1] NA
=========================================================================================================
Using the predictive model, the plot resulted with the following error:
In predict.lm(m6, newdata = This_loan, interval = “prediction”, : prediction from a rank-deficient fit may be misleading
This is another challenge I am running into and I could not find anything on the web to address this error. I contacted my mentors but haven’t heard back from him either. So I decided to try another method, because the predictive model may not work anyway.
The standard deviation summary is what I used. The plot is showing the standard deviation summary of the residuals across Estimated Return Rate and we see the residuals variability is not equal.
The investors are attracted to these investments because their risks are very diversified. They also get their investment back on a monthly basis. They could recover their cost first (because payments are mostly on interest first) before the borrowers could default on their loans. Prosper has a very smart business model which help both the Investors and Borrowers at the same time.
=========================================================================================================
I chose 3 plots which represents the point of views from: The Investor, The Borrower and Prosper Company.
Plot 1, Investor’s point of view: which show the Loan Amount (which is the loan investment by Investors) and their rate of return. It shows that this is a risky market. Bigger loan amounts do not mean higher yield and actually it’s the opposite. Investors are doing better staying below $20000 range. The return of investment is about 9% on the average.
Plot 2, Prosper’s point of view: This plot shows the Borrower’s rate of loan against Prosper’s ranking. The better the score, the lower interest rates, which is expected as a norm.
Plot 3, Borrower’s point of view: This plot shows why Borrower’s are willing to pay higher interest rates to Prosper by consolidating their revolving credit card debt. With lower monthly payments, it is more affordable and that they improve on their overall credit scores.
When I chose this project, I did not know what to expect as I do not know anything about Prosper. As I work on it, I discovered things that are unusual, such as there was no activities in the periods of 2008-2009. So I did some research and learned so much about them. With this knowledge, I am able to analyze its loan in a more intelligent ways.
Sometimes I put a hat on as a borrower to understand their view of points, why they are willing to pay for higher rates than their current credit cards.
(This is one of my assumptions as I didn’t know any credit cards with rates > 20%.) And why they think going with Prosper will benefit them?
Sometimes I look at the data from an Investor point of view. With borrowers having such high credit card debts, low credit score, what makes Prosper investment attractive? All of these questions are unveiled when I get deeper into the analysis and actually it surprised me quite a bit. The results are sometime not as I expected to see as already mentioned and explained in each phase of the analysis.
I ran into many challenges with the codes. Most of the time, I can resolve it by looking it up, went back to the class lessons or just try different ways.
It takes times to correct these issues but I also learned new things from it.
For example, the below code always resulted in an error. As it does not recognize the select() function. I checked many times, refreshed, reviewed my notes, review my lectures but nothing worked. Then I google and found this solution, by typing the code ls(“package: MASS”) and added dplyr:: before select(), then it worked fine.
I also had huge problem when running ggpairs at the very end of the project. I spent over 12 hours on that code alone. I know I did not wisely spending my time on the project. I should move on and use another plot to get the project done. However, I could not let go because I wanted to understand why it did not work. I followed the instructions very well, checked for every single character of my codes. I used set.seed(), select sample from data and subset it. It kept saying that my codes are much larger than the 15 ? allowed? But I did it at the end and I am so happy about it.
In summary, I am glad that I chose this data set to work on because first of all, I don’t drink and I don’t know anything about wine. Secondly, there were not too many variables to analyze from the wine project. This Proser Loan is interesting and I think I can spend days to analyze it further. There are so mamy variables that I did not even touch. They are all good variables with information that investors could use to analyze their investment of choice. My curiousity just kicked in as I work on it further and deeper. Sometimes I feel like I am repeating but actually I am not. You must look at it from the broad perspective first and then as you peel the onions, you discover things that you did not expect to see. That’s exactly how I felt working on with this project.
=========================================================================================================